Landuse clustering

Download this project.


Spectral Clustering Example

The image loaded here is a cropped portion of a LANDSAT image of Walker Lake.

In addition to dask-ml, we'll use rasterio to read the data and matplotlib to plot the figures. I'm just working on my laptop, so we could use either the threaded or distributed scheduler, but here I'll use the distributed scheduler for the diagnostics.

In [1]:
import rasterio
import numpy as np
import xarray as xr
import holoviews as hv
from holoviews import opts
from holoviews.operation.datashader import regrid
import cartopy.crs as ccrs
import dask.array as da
#from dask_ml.cluster import SpectralClustering
from dask.distributed import Client
hv.extension('bokeh')
In [2]:
import dask_ml
In [3]:
dask_ml.__version__
Out[3]:
'1.2.0'
In [4]:
from dask_ml.cluster import SpectralClustering
In [5]:
client = Client(processes=False)
#client = Client(n_workers=8, threads_per_worker=1)
client
Out[5]:

Client

Cluster

  • Workers: 1
  • Cores: 8
  • Memory: 17.18 GB
In [6]:
import intake
cat = intake.open_catalog('./catalog.yml')
list(cat)
Out[6]:
['landsat_5']
In [7]:
landsat_5_img = cat.landsat_5.read_chunked()
landsat_5_img
Out[7]:
Show/Hide data repr Show/Hide attributes
xarray.DataArray
  • band: 6
  • y: 7241
  • x: 7961
  • dask.array<chunksize=(1, 256, 256), meta=np.ndarray>
    Array Chunk
    Bytes 691.75 MB 131.07 kB
    Shape (6, 7241, 7961) (1, 256, 256)
    Count 11142 Tasks 5568 Chunks
    Type int16 numpy.ndarray
    7961 7241 6
    • x
      (x)
      float64
      2.424e+05 2.424e+05 ... 4.812e+05
      array([242400., 242430., 242460., ..., 481140., 481170., 481200.])
    • y
      (y)
      float64
      4.414e+06 4.414e+06 ... 4.197e+06
      array([4414200., 4414170., 4414140., ..., 4197060., 4197030., 4197000.])
    • band
      (band)
      int64
      1 2 3 4 5 7
      array([1, 2, 3, 4, 5, 7])
  • transform :
    (30.0, 0.0, 242385.0, 0.0, -30.0, 4414215.0)
    crs :
    +init=epsg:32611
    res :
    (30.0, 30.0)
    is_tiled :
    0
    nodatavals :
    (-9999.0,)
    scales :
    (1.0,)
    offsets :
    (0.0,)
    descriptions :
    ('band 1 surface reflectance',)
    AREA_OR_POINT :
    Area
    Band_1 :
    band 1 surface reflectance
In [8]:
crs=ccrs.epsg(32611)
In [9]:
x_center, y_center = crs.transform_point(-118.7081, 38.6942, ccrs.PlateCarree())
buffer = 1.7e4

xmin = x_center - buffer
xmax = x_center + buffer
ymin = y_center - buffer
ymax = y_center + buffer

ROI = landsat_5_img.sel(x=slice(xmin, xmax), y=slice(ymax, ymin))
ROI = ROI.where(ROI > ROI.nodatavals[0])
In [10]:
bands = ROI.astype(float)
bands = (bands - bands.mean()) / bands.std()
bands
Out[10]:
Show/Hide data repr Show/Hide attributes
xarray.DataArray
  • band: 6
  • y: 1134
  • x: 1133
  • dask.array<chunksize=(1, 74, 3), meta=np.ndarray>
    Array Chunk
    Bytes 61.67 MB 524.29 kB
    Shape (6, 1134, 1133) (1, 256, 256)
    Count 12945 Tasks 216 Chunks
    Type float64 numpy.ndarray
    1133 1134 6
    • x
      (x)
      float64
      3.345e+05 3.345e+05 ... 3.684e+05
      array([334470., 334500., 334530., ..., 368370., 368400., 368430.])
    • y
      (y)
      float64
      4.301e+06 4.301e+06 ... 4.267e+06
      array([4301220., 4301190., 4301160., ..., 4267290., 4267260., 4267230.])
    • band
      (band)
      int64
      1 2 3 4 5 7
      array([1, 2, 3, 4, 5, 7])
In [11]:
opts.defaults(
    opts.Image(invert_yaxis=True, width=250, height=250, tools=['hover'], cmap='viridis'))
In [12]:
hv.Layout([regrid(hv.Image(band, kdims=['x', 'y'])) for band in bands[:3]])
Out[12]:
In [13]:
flat_input = bands.stack(z=('y', 'x'))
flat_input
Out[13]:
Show/Hide data repr Show/Hide attributes
xarray.DataArray
  • band: 6
  • z: 1284822
  • dask.array<chunksize=(1, 13596), meta=np.ndarray>
    Array Chunk
    Bytes 61.67 MB 416.94 kB
    Shape (6, 1284822) (1, 52118)
    Count 13629 Tasks 216 Chunks
    Type float64 numpy.ndarray
    1284822 6
    • band
      (band)
      int64
      1 2 3 4 5 7
      array([1, 2, 3, 4, 5, 7])
    • z
      (z)
      MultiIndex
      (y, x)
      array([(4301220.0, 334470.0), (4301220.0, 334500.0), (4301220.0, 334530.0),
             ..., (4267230.0, 368370.0), (4267230.0, 368400.0),
             (4267230.0, 368430.0)], dtype=object)
    • y
      (z)
      float64
      4.301e+06 4.301e+06 ... 4.267e+06
      array([4301220., 4301220., 4301220., ..., 4267230., 4267230., 4267230.])
    • x
      (z)
      float64
      3.345e+05 3.345e+05 ... 3.684e+05
      array([334470., 334500., 334530., ..., 368370., 368400., 368430.])
In [14]:
flat_input.shape
Out[14]:
(6, 1284822)

We'll reshape the image to be how dask-ml / scikit-learn expect it: (n_samples, n_features) where n_features is 1 in this case. Then we'll persist that in memory. We still have a small dataset at this point. The large dataset, which dask helps us manage, is the intermediate n_samples x n_samples array that spectral clustering operates on. For our 2,500 x 2,500 pixel subset, that's ~50

In [15]:
X = flat_input.values.astype('float').T
X.shape
Out[15]:
(1284822, 6)
In [16]:
X = da.from_array(X, chunks=100_000)
X = client.persist(X)

And we'll fit the estimator.

In [17]:
clf = SpectralClustering(n_clusters=4, random_state=0,
                         gamma=None,
                         kmeans_params={'init_max_iter': 5},
                         persist_embedding=True)
In [18]:
%time clf.fit(X)
CPU times: user 54.6 s, sys: 12.9 s, total: 1min 7s
Wall time: 27.6 s
Out[18]:
SpectralClustering(affinity='rbf', assign_labels='kmeans', coef0=1, degree=3,
                   eigen_solver=None, eigen_tol=0.0, gamma=None,
                   kernel_params=None, kmeans_params={'init_max_iter': 5},
                   n_clusters=4, n_components=100, n_init=10, n_jobs=1,
                   n_neighbors=10, persist_embedding=True, random_state=0)
In [19]:
labels = clf.assign_labels_.labels_.compute()
labels.shape
Out[19]:
(1284822,)
In [20]:
labels = labels.reshape(bands[0].shape)
In [21]:
hv.Layout([regrid(hv.Image(band, kdims=['x', 'y'])) for band in bands]) 
Out[21]:
In [22]:
hv.Layout([regrid(hv.Image(band, kdims=['x', 'y'])) for band in bands[3:]])
Out[22]:
In [23]:
hv.Image(labels)
Out[23]:
This web page was generated from a Jupyter notebook and not all interactivity will work on this website. Right click to download and run locally for full Python-backed interactivity.

Download this project.